Image spam filtering based on senders&#39; intention analysis

ABSTRACT

Systems and methods for an anti-spam detection module that can detect image spam are provided. According to one embodiment, an image spam detection process involves determining and measuring various characteristics of images that may be embedded within or otherwise associated with an electronic mail (email) message. An approximate display location of the embedded images is determined. The existence of one or more abnormal factors associated with the embedded images is identified. A quantity of text included in the one or more embedded images is determined and measured by analyzing one or more blocks of binarized representations of the one or more embedded images. Finally, the likelihood that the email message is spam is determined based on one or more of the approximate display location, the existence of one or more abnormal factors and the quantity and location of text measured.

COPYRIGHT NOTICE

Contained herein is material that is subject to copyright protection.The copyright owner has no objection to the facsimile reproduction ofthe patent disclosure by any person as it appears in the Patent andTrademark Office patent files or records, but otherwise reserves allrights to the copyright whatsoever. Copyright© 2007, Fortinet, Inc.

BACKGROUND

1. Field

Embodiments of the present invention generally relate to the field ofspam filtering and anti-spam techniques. In particular, variousembodiments relate to image analysis and methods for combating spam inwhich spammers use images to carry the advertising text.

2. Description of the Related Art

Image spam was originally created in order to get past heuristicfilters, which block messages containing words and phrases commonlyfound in spam. Since image files have different formats than the textfound in the message body of an electronic mail (email) message,conventional heuristic filters, which analyze such text do not detectthe content of the message, which may be partly or wholly conveyed byembedded text within the image. As a result, heuristic filters wereeasily defeated by image spam techniques.

To address this spamming technique, fuzzy signature technologies, whichflag both known and similar messages as spam, were deployed by anti-spamvendors. Such fuzzy signature technologies allowed message attachmentsto be targeted, thereby recognizing as spam messages with differentcontent but the same attachment.

Spammers now alter the images to make the email message appear differentto signature-based filtering approaches yet while maintainingreadability of the embedded text message to human viewers. The contentof images lies in two levels: (i) the pixel matrix and (ii) the text orgraphics these pixel matrices represent. At present, the notion ofpixel-based matching does not make sense, as the same text could berepresented by countless pixel matrices by simply changing variousattributes, such as the font, size, color or by adding noise. Therefore,hash matching and other signature-based approaches have essentially beenrendered useless to block image spam as they fail as a result of evenminor changes to the background of the image.

Some vendors have attempted to catch image spam by employing OpticalCharacter Recognition (OCR) techniques; however, such approaches haveonly limited success in view of spammers' use of techniques to obscurethe embedded text messages with a variety of noise. FIGS. 1A and 1Billustrate sample images and obfuscation techniques used by spammers todefeat OCR image spam detection techniques. As shown in FIGS. 1A and 1B,polygons, lines, random colors, jagged text, random dots, varyingborders and the like may be inserted into image spam in an attempt todefeat signature detection techniques and obscure the embedded text fromOCR techniques. There are an almost infinite number of ways thatspammers can randomize images. In addition to the foregoing obfuscationtechniques, spammers have recently used techniques such as varying thecolors used in an image, changing the width and/or pattern of theborder, altering the font style, and slicing images into smaller pieces(which are then reassembled to appear as a single image to therecipient). Meanwhile, OCR is very computationally expensive. Dependingupon the implementation, fully rendering a message and then looking forword matches against different character set libraries may take as longas several seconds per message, which is typically unacceptable for manycontexts.

SUMMARY

Systems and methods are described for an anti-spam detection module thatcan detect image spam. According to one embodiment, one or more of thequantity and position of text within an image associated with anelectronic message are measured or estimated. Then, based at least inpart on the results of the measuring or estimating, the likelihood thatthe electronic message is spam is determined.

According to another embodiment, an embedded image of an electronic mail(email) message is converted to a binarized representation by performingthresholding on a grayscale representation of the embedded image. Aquantity of text included in the embedded image is then determined andmeasured by analyzing one or more blocks of the binarizedrepresentations. Finally, the email message is classified as spam orclean based at least in part on the quantity of text measured.

In one embodiment, the embedded image may be formatted in accordancewith the Graphic Interchange Format (GIF), Joint Photographic ExpertsGroup (JPEG) or Portable Network Graphics (PNG) formats/standards.

In one embodiment, the embedded image may be an image contained within afile attached to the email message.

In one embodiment, the method also includes determining an approximatedisplay location of an embedded image within the email message andidentifying existence of one or more abnormal factors associated withthe embedded image. Then, the classification can be based upon theapproximate display location, the existence of one or more abnormalfactors as well as the quantity of text measured.

Other features of embodiments of the present invention will be apparentfrom the accompanying drawings and from the detailed description thatfollows.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention are illustrated by way of example,and not by way of limitation, in the figures of the accompanyingdrawings and in which like reference numerals refer to similar elementsand in which:

FIGS. 1A and 1B illustrate sample images and obfuscation techniques usedby spammers to defeat OCR image spam detection techniques.

FIG. 2 is a block diagram conceptually illustrating a simplified networkarchitecture in which embodiments of the present invention may beemployed.

FIG. 3 is a block diagram conceptually illustrating interaction amongvarious functional units of an email security system with a clientworkstation and an email server in accordance with an embodiment of thepresent invention.

FIG. 4 is an example of a computer system with which embodiments of thepresent invention may be utilized.

FIG. 5 is a high-level flow diagram illustrating anti-spam processing ofimages using sender's intention analysis in accordance with anembodiment of the present invention.

FIG. 6 is a flow diagram illustrating quantity of text measurementprocessing in accordance with an embodiment of the present invention.

FIG. 7 is an example of an image spam email message containing anembedded image.

FIG. 8 is a grayscale image based on the embedded image of FIG. 7according to one embodiment of the present invention.

FIG. 9 is an intensity histogram for the grayscale image of FIG. 8according to one embodiment of the present invention.

FIG. 10 is a binary image resulting from thresholding the grayscaleimage of FIG. 8 in accordance with an embodiment of the presentinvention.

FIG. 11 illustrates an exemplary segmentation of the binary image ofFIG. 10 into 28 virtual blocks and highlights the text strings detectedwithin the blocks in accordance with an embodiment of the presentinvention.

FIG. 12 is a grayscale image based on another exemplary embedded imageobserved in connection with image spam.

FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks abinarized image corresponding to the image of FIG. 12 and highlights thetext strings detected within the blocks in accordance with an embodimentof the present invention.

DETAILED DESCRIPTION

Systems and methods are described for an anti-spam detection module thatcan detect various forms of image spam. According to one embodiment,images attached to or embedded within email messages are analyzed todetermine the senders' intention. Empirical analysis reveals legitimateemails may contain embedded images, but valid images sent through emailrarely contain a substantial quantity of text. Additionally, whenlegitimate images are included within email messages, the senders ofsuch email messages do not painstakingly adjust the location of suchincluded images to assure such images appear in the preview window of anemail client. Furthermore, legitimate senders do not intentionallyinject noise into the embedded images. In contrast, spammers usuallycompose email messages in different ways. For example, in the context ofimage spam, spammers insert text into images to avoid filtering bytraditional text filters and employ techniques to randomize imagesand/or obscure text embedded within images. Spammers also typically makegreat efforts to draw attention to their image spam by carefully placingthe image in such a manner as to make it visible to the recipient in thepreview window/pane of an email client that supports HTML email, such asNetscape Messenger or Microsoft Outlook. Consequently, variousindicators of image spam include, but are not limited to, inclusion ofone or more images in the front part of an email message, inclusion ofone or more images containing text meeting a certain threshold and/orinclusion of one or more images into which noise appears to have beeninjected to obfuscate embedded text.

According to one embodiment, various image analysis techniques areemployed to more accurately detect image spam based on senders'intention analysis. The goal of senders' intention analysis is todiscover the email message sender's intent by examining variouscharacteristics of the email message and the embedded or attachedimages. If it appears, for example, after performing image analysis thatone or more images associated with an email message have had one or moreobfuscation techniques applied, the intent is to draw attention to theone or more images and/or the one or more images include suspiciousquantities of text, then the sender's intention analysis anti-spamprocessing may flag the email message at issue as spam. In oneembodiment, the image scanning spam detection method is based on acombination of email header analysis, email body analysis and imageprocessing on image attachments.

Importantly, although various embodiments of the anti-spam detectionmodule and image scanning methodologies are discussed in the context ofan email security system, they are equally applicable to networkgateways, email appliances, client workstations, servers and othervirtual or physical network devices or appliances that may be logicallyinterposed between client workstations and servers, such as firewalls,network security appliances, email security appliances, virtual privatenetwork (VPN) gateways, switches, bridges, routers and like devicesthrough which email messages flow. Furthermore, the anti-spam techniquesdescribed herein are equally applicable to instant messages, (MultimediaMessage Service) MMS messages and other forms of electroniccommunications in the event that such message become vulnerable to imagespam in the future.

Additionally, various embodiments of the present invention are describedwith reference to filtering of incoming email messages. However, it isto be understood, that the filtering methodologies described herein areequally applicable to email messages originated within an enterprise andcirculated internally or outgoing email messages intended for recipientsoutside of the enterprise. Therefore, the specific examples presentedherein are not intended to be limiting and are merely representative ofexemplary functionality.

Furthermore, while, for convenience, various embodiments of the presentinvention may be described with reference to detecting image spam in thegraphic/image file formats currently most prevalent (i.e., GraphicInterchange Format (GIF), Joint Photographic Experts Group (JPEG) andPortable Network Graphics (PNG) graphic/image file formats), embodimentsof the present invention are not so limited and are equally applicableto various other current and future graphic/image file formats,including, but not limited to, Progressive Graphics File (PGF), TaggedImage File Format (TIFF), bit mapped format (BMP), HDP, WDP, XPM,MacOS-PICT, Irix-RGB, Multiresolution Seamless Image Database (MrSID),RAW formats used by various digital cameras, various vector formats,such as Scalable Vector Graphics (SVG), as well as other file formats ofattachments which may themselves contain embedded images, such asPortable Document Format (PDF), Encapsulated PostScript, SWF, WindowsMetafile, AutoCAD DXF and CorelDraw CDR.

In the following description, numerous specific details are set forth inorder to provide a thorough understanding of embodiments of the presentinvention. It will be apparent, however, to one skilled in the art thatembodiments of the present invention may be practiced without some ofthese specific details. In other instances, well-known structures anddevices are shown in block diagram form.

Embodiments of the present invention include various steps, which willbe described below. The steps may be performed by hardware components ormay be embodied in machine-executable instructions, which may be used tocause a general-purpose or special-purpose processor programmed with theinstructions to perform the steps. Alternatively, the steps may beperformed by a combination of hardware, software, firmware and/or byhuman operators.

Embodiments of the present invention may be provided as a computerprogram product, which may include a machine-readable medium havingstored thereon instructions, which may be used to program a computer (orother electronic devices) to perform a process. The machine-readablemedium may include, but is not limited to, floppy diskettes, opticaldisks, compact disc read-only memories (CD-ROMs), and magneto-opticaldisks, ROMs, random access memories (RAMs), erasable programmableread-only memories (EPROMs), electrically erasable programmableread-only memories (EEPROMs), magnetic or optical cards, flash memory,or other type of media/machine-readable medium suitable for storingelectronic instructions. Moreover, embodiments of the present inventionmay also be downloaded as a computer program product, wherein theprogram may be transferred from a remote computer to a requestingcomputer by way of data signals embodied in a carrier wave or otherpropagation medium via a communication link (e.g., a modem or networkconnection).

Terminology

Brief definitions of terms used throughout this application are givenbelow.

The term “client” generally refers to an application, program, processor device in a client/server relationship that requests information orservices from another program, process or device (a server) on anetwork. Importantly, the terms “client” and “server” are relative sincean application may be a client to one application but a server toanother. The term “client” also encompasses software that makes theconnection between a requesting application, program, process or deviceto a server possible, such as an FTP client.

The terms “connected” or “coupled” and related terms are used in anoperational sense and are not necessarily limited to a direct connectionor coupling. Thus, for example, two devices may be coupled directly, orvia one or more intermediary media or devices. As another example,devices may be coupled in such a way that information can be passedthere between, while not sharing any physical connection with oneanother. Based on the disclosure provided herein, one of ordinary skillin the art will appreciate a variety of ways in which connection orcoupling exists in accordance with the aforementioned definition.

The phrase “embedded image” generally refers to an image that isdisplayed or rendered inline within a styled or formatted electronicmessage, such as a HyperText Markup Language (HTML)-based or formattedemail message. As used herein, the phrase “embedded image” is intendedto encompass scenarios in which the image data is sent with the emailmessage and linked images in which a reference to the image is sent withthe email message and the image data is retrieved once the recipientviews the email message. The phrase “embedded image” also includes animage that is embedded in other file formats of attachments, such asPortable Document Format (PDF) attachments, in which the image data isdisplayed to the email recipient when the attachment is viewed.

The phrase “image spam” generally refers to spam in which the “call toaction” of the message is partially or completely contained within anembedded file attachment, such as a .gif or jpeg or .pdf file, ratherthan in the body of the email message. Such images are typicallyautomatically displayed to the email recipients and typically some formof obfuscation is implemented in an attempt to hide the true content ofthe image from spam filters.

The phrases “in one embodiment,” “according to one embodiment,” and thelike generally mean the particular feature, structure, or characteristicfollowing the phrase is included in at least one embodiment of thepresent invention, and may be included in more than one embodiment ofthe present invention. Importantly, such phrases do not necessarilyrefer to the same embodiment.

The phrase “network gateway” generally refers to an internetworkingsystem, a system that joins two networks together. A “network gateway”can be implemented completely in software, completely in hardware, or asa combination of the two. Depending on the particular implementation,network gateways can operate at any level of the OSI model fromapplication protocols to low-level signaling.

If the specification states a component or feature “may”, “can”,“could”, or “might” be included or have a characteristic, thatparticular component or feature is not required to be included or havethe characteristic.

The term “proxy” generally refers to an intermediary device, program oragent, which acts as both a server and a client for the purpose ofmaking or forwarding requests on behalf of other clients.

The term “responsive” includes completely or partially responsive.

The term “server” generally refers to an application, program, processor device in a client/server relationship that responds to requests forinformation or services by another program, process or device (a server)on a network. The term “server” also encompasses software that makes theact of serving information or providing services possible.

The term “spam” generally refers to electronic junk mail, typically bulkelectronic mail (email) messages in the form of commercial advertising.Often, email message content may be irrelevant in determining whether anemail message is spam, though most spam is commercial in nature. Thereis spam that fraudulently promotes penny stocks in the classicpump-and-dump scheme. There is spam that promotes religious beliefs.From the recipient's perspective, spam typically represents unsolicited,unwanted, irrelevant, and/or inappropriate email messages, oftenunsolicited commercial email (UCE). In addition to UCE, spam includes,but is not limited to, email messages regarding or associated withfraudulent business schemes, chain letters, and/or offensive sexual orpolitical messages.

According to one embodiment “spam” comprises Unsolicited Bulk Email(UBE). Unsolicited generally means the recipient of the email messagehas not granted verifiable permission for the email message to be sentand the sender has no discernible relationship with all or some of therecipients. Bulk generally refers to the fact that the email message issent as part of a larger collection of email messages, all havingsubstantively identical content. In embodiments in which spam is equatedwith UBE, an email message is considered spam if it is both unsolicitedand bulk. Unsolicited email can be normal email, such as first contactenquiries, job enquiries, and sales enquiries. Bulk email can be normalemail, such as subscriber newsletters, customer communications,discussion lists, etc. Consequently, in such embodiments, an emailmessage would be considered spam (i) the recipient's personal identityand context are irrelevant because the email message is equallyapplicable to many other potential recipients; and (ii) the recipienthas not verifiably granted deliberate, explicit, and still-revocablepermission for the email message to be sent.

The phrase “transparent proxy” generally refers to a specialized form ofproxy that only implements a subset of a given protocol and allowsunknown or uninteresting protocol commands to pass unaltered.Advantageously, as compared to a full proxy in which use by a clienttypically requires editing of the client's configuration file(s) topoint to the proxy, it is not necessary to perform such extraconfiguration in order to use a transparent proxy.

FIG. 2 is a block diagram conceptually illustrating a simplified networkarchitecture in which embodiments of the present invention may beemployed. In this simple example, spammers 205 are shown coupled to thepublic Internet 200 to which local area network (LAN) 240 is alsocommunicatively coupled through a firewall 210, a network gateway 215and an email security system 220, which incorporates within an anti-spammodule 225 various novel image spam detection methodologies that aredescribed further below.

In the present example, the email security system 220 is logicallyinterposed between spammers and the email server 230 to perform spamfiltering on incoming email messages from the public Internet 200 priorto receipt and storage on the email server 230 from which and throughwhich client workstations 260 residing on the LAN 240 may retrieve andsend email correspondence.

In the exemplary network architecture of FIG. 2, the firewall 210 mayrepresent a hardware or software solution configured to protect theresources of LAN from outsiders and to control what outside resourceslocal users have access to by enforcing security policies. The firewall210 may filter or disallow unauthorized or potentially dangerousmaterial or content from entering LAN 240 and may otherwise limit accessbetween the LAN 240 and the public Internet 200 in accordance with localsecurity policy established and maintained by an administrator of LAN240.

In one embodiment, the network gateway 215 acts as an interface betweenthe LAN 240 and the public Internet 200. The network gateway 215 may,for example, translate between dissimilar protocols used internally andexternally to the LAN 240. Depending upon the distribution offunctionality, the network gateway 215 or the firewall 210 may performnetwork address translation (NAT) to hide private Internet Protocol (IP)addresses used within LAN 240 and enable multiple client workstations,such as client workstations 260, to access the public Internet 200 usinga single public IP address.

According to one embodiment, the email security system 220 performsemail filtering to detect, tag, block and/or remove unwanted spam andmalicious attachments. In one embodiment, an anti-spam module 225 of theemail security system 220, performs one or more spam filteringtechniques, including but not limited to, sender IP reputation analysisand content analysis, such as attachment/content filtering, heuristicrules, deep email header inspection, spam URI real-time blacklists(SURBL), banned word filtering, spam checksum blacklist, forged IPchecking, greylist checking, Bayesian classification, Bayesianstatistical filters, signature reputation, and/or filtering methods suchas FortiGuard-Antispam, access policy filtering, global and userblack/white list filtering, spam Real-time Blackhole List (RBL), DomainName Service (DNS) Block List (DNSBL) and per user Bayesian filtering sothat individual users can set their own profiles.

The anti-spam module 225 also performs various novel image spamdetection methodologies or spam image analysis scanning based onsender's intention analysis in an attempt to detect, tag, block and/orremove spam presented in the form of one or more images. Examples of theimage analysis techniques and the sender's intention analysismethodologies are described in more detail below. Existing emailsecurity platforms that exemplifies various operational characteristicsof the email security system 220 according to an embodiment of thepresent invention include the FortiMail™ family of high-performance,multi-layered email security platforms, including the FortiMail-100platform, the FortiMail-400 platform, the FortiMail-2000 platform andthe FortiMail-4000A platform all of which are available from Fortinet,Inc. of Sunnyvale, Calif.

FIG. 3 is a block diagram conceptually illustrating interaction amongvarious functional units of an email security system 320 with a clientworkstation 360 and an email server 330 in accordance with an embodimentof the present invention.

While in this simplified example, only a single client workstation,i.e., client workstation 360, and a single email server, i.e., emailserver 330, are shown interacting with the email security system 320, itshould be understood that many local and/or remote client workstations,servers and email servers may interact directly or indirectly with theemail security system 320 and directly or indirectly with each other.

According to the present example, the email security system 320, whichmay be implemented as one or more virtual or physical devices, includesa content processor 321, logically interposed between sources of inboundemail 380 and an enterprise's email server 330. In the context of thepresent example, the content processor 321 performs scanning of inboundemail messages 380 originating from sources on the public Internet 200before allowing such inbound email messages 380 to be stored on theemail server 330. In one embodiment, an anti-spam module 325 of thecontent processor 321 may perform spam filtering and an anti-virus (AV)module 326 implementing AV and other filters potentially performs othertraditional anti-virus detection and content filtering on dataassociated with the email messages.

In the current example, anti-spam module 325 may apply various imageanalysis methodologies described further below to ascertain emailsenders' intentions and therefore the likelihood that attached and/orembedded images represent image spam. According to the current example,the anti-spam module 325, responsive to being presented with an inboundemail message, determines whether the email message contains embedded orattached images and if so, as described further below with reference toFIG. 5 and FIG. 6, determines if such images represent image spam.

In one embodiment, the content processor 321 is an integrated FortiASIC™Content Processor chip developed by Fortinet, Inc. of Sunnyvale, Calif.In alternative embodiments, the content processor 321 may be a dedicatedcoprocessor or software to help offload content filtering tasks from ahost processor (not shown).

In alternative embodiments, the anti-spam module 325 may be associatedwith or otherwise responsive to a mail transfer protocol proxy (notshown). The mail transfer protocol proxy may be implemented as atransparent proxy that implements handlers for Simple Mail TransferProtocol (SMTP) or Extended SMTP (ESMTP) commands/replies relevant tothe performance of content filtering activities and passes through thosenot relevant to the performance of content filtering activities. In oneembodiment, the mail transfer protocol proxy may subject each ofincoming email, outgoing email and internal email to scanning by theanti-spam module 325 and/or the content processor 321.

Notably, filtering of email need not be performed prior to storage ofemail message on the email server 330. In alternative embodiments, thecontent processor 321, the mail transfer protocol proxy (not shown) orsome other functional unit logically interposed between a user agent oremail client 361 executing on the client workstation 360 and the emailserver 330 may process email messages at the time they are requested tobe transferred from the user agent/email client 361 to the email server330 or vice versa. Meanwhile, neither the email messages nor theirattachments need be stored locally on the email security system 320 tosupport the filtering functionality described herein. For example,instead of the anti-spam processing running responsive to a mailtransfer protocol proxy, the email security system 320 may open a directconnection between the email client 361 and the email server 330, andfilter email in real-time as it passes through.

While in the context of the present example, the content processor 321,the anti-spam module 325 and the mail transfer protocol proxy (notshown) have been described as residing within or as part of the samenetwork device, in alternative embodiments one or more of thesefunctional units may be located remotely from the other functionalunits. According to one embodiment, the hardware components and/orsoftware modules that implement the content processor 321 the anti-spammodule 325 and the mail transfer protocol proxy are generally providedon or distributed among one or more Internet and/or LAN accessiblenetworked devices, such as one or more network gateways, firewalls,network security appliances, email security systems, switches, bridges,routers, data storage devices, computer systems and the like.

In one embodiment, the functionality of one or more of theabove-referenced functional units may be merged in various combinations.For example, the content processor 321 may be incorporated within themail transfer protocol proxy or the anti-spam module 325 may beincorporated within the email server 330 or the email client 361.Moreover, the functional units can be communicatively coupled using anysuitable communication method (e.g., message passing, parameter passing,and/or signals through one or more communication paths etc.).Additionally, the functional units can be physically connected accordingto any suitable interconnection architecture (e.g., fully connected,hypercube, etc.).

According to embodiments of the invention, the functional units can beany suitable type of logic (e.g., digital logic) for executing theoperations described herein. Any of the functional units used inconjunction with embodiments of the invention can includemachine-readable media including instructions for performing operationsdescribed herein. Machine-readable media include any mechanism thatprovides (i.e., stores and/or transmits) information in a form readableby a machine (e.g., a computer). For example, a machine-readable mediumincludes read only memory (ROM), random access memory (RAM), magneticdisk storage media, optical storage media, flash memory devices,electrical, optical, acoustical or other forms of propagated signals(e.g., carrier waves, infrared signals, digital signals, etc.), etc.

FIG. 4 is an example of a computer system with which embodiments of thepresent invention may be utilized. The computer system 300 may representor form a part of an email security system, network gateway, firewall,network appliance, switch, bridge, router, data storage devices, server,client workstation and/or other network device implementing one or moreof the content processor 321 or other functional units depicted in FIG.3. According to FIG. 4, the computer system 400 includes one or moreprocessors 405, one or more communication ports 410, main memory 415,read only memory 420, mass storage 425, a bus 430, and removable storagemedia 440.

The processor(s) 405 may be Intel® Itanium® or Itanium 2® processor(s),AMD® Opteron® or Athlon MP® processor(s) or other processors known inthe art.

Communication port(s) 410 represent physical and/or logical ports. Forexample communication port(s) may be any of an RS-232 port for use witha modem based dialup connection, a 10/100 Ethernet port, or a Gigabitport using copper or fiber. Communication port(s) 410 may be chosendepending on a network such a Local Area Network (LAN), Wide AreaNetwork (WAN), or any network to which the computer system 400 connects.

Communication port(s) 410 may also be the name of the end of a logicalconnection (e.g., a Transmission Control Protocol (TCP) port or a UserDatagram Protocol (UDP) port). For example communication ports may beone of the Well Know Ports, such as TCP port 25 or UDP port 25 (used forSimple Mail Transfer), assigned by the Internet Assigned NumbersAuthority (IANA) for specific uses.

Main memory 415 may be Random Access Memory (RAM), or any other dynamicstorage device(s) commonly known in the art.

Read only memory 420 may be any static storage device(s) such asProgrammable Read Only Memory (PROM) chips for storing staticinformation such as instructions for processors 405.

Mass storage 425 may be used to store information and instructions. Forexample, hard disks such as the Adaptec® family of SCSI drives, anoptical disc, an array of disks such as RAID, such as the Adaptec familyof RAID drives, or any other mass storage devices may be used.

Bus 430 communicatively couples processor(s) 405 with the other memory,storage and communication blocks. Bus 430 may be a PCI/PCI-X or SCSIbased system bus depending on the storage devices used.

Optional removable storage media 440 may be any kind of externalhard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read OnlyMemory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk(DVD)-Read Only Memory (DVD-ROM), Re-Writable DVD and the like.

FIG. 5 is a high-level flow diagram illustrating anti-spam processing ofimages using sender's intention analysis in accordance with anembodiment of the present invention. Depending upon the particularimplementation, the various process and decision blocks described belowmay be performed by hardware components, embodied in machine-executableinstructions, which may be used to cause a general-purpose orspecial-purpose processor programmed with the instructions to performthe steps, or the steps may be performed by a combination of hardware,software, firmware and/or involvement of humanparticipation/interaction.

At block 510, an email message is analyzed to determine if it containsimages. For purposes of the present example, the direction of flow ofthe email message is not pertinent. As indicated above, the emailmessage may be inbound, outbound or an intra-enterprise email message.In various embodiments, however, the anti-spam processing may be enabledin one direction only or various detection threshholds could beconfigured differently for different flows.

In any event, in one embodiment, the headers, body and attachments, ifany, of the email message at issue are parsed and scanned to identifywhether the email message is deemed to contain one or more embeddedimages. If so, processing continues with block 520. Otherwise, nofurther image spam analysis is required and processing branches to theend.

At block 520, the email message at issue has been determined to containone or more embedded images. In the current example, the senders'intention analysis anti-spam processing, therefore, proceeds tocalculate the location(s) of the embedded image(s). Images may beembedded in a HyperText Markup Language (HTML) part of an HTML formattedemail message, within a MIME document or attached separately. In oneembodiment, by parsing the HTML, plain text and/or other MultipurposeInternet Mail Extension (MIME) parts, the displaying line just prior tothe images can be identified and thus the approximate displayinglocation of any embedded images can be calculated.

At block 530, the one or more images are analyzed for indications of oneor more abnormal factors. Typically, the abnormal factors aremanifestations of a spammer's attempt to obscure text embedded withinthe one or more images by injecting a variety of noise. In oneembodiment, abnormal factors include the presence of one or more of thefollowing characteristics (i) illegal base64 encoding; (ii) multipleimages within one HTML part; (iii) one or more low entropy frames in ananimated Graphic Interchange Format (GIF); (iv) illegal chunk datawithin a Portable Network Graphics (PNG) file; (v) quantities ofunsmoothed curves; and (iv) quantities of unsmoothed color blocks.

In one embodiment, illegal base64 encoding can be detected by, amongother things, observing illegal characters, such as ‘!’ in the encodedcontent, such as the HTML formatted message or any part of the MIMEdocument.

In one embodiment, the inclusion of multiple images within one HTML partcan be detected by parsing the HTML formatted email message andobserving more than one image within an HTML part. In the exemplary HTMLcode excerpt below, the existence of three images within a single tablerow (<tr> . . . </tr>) reveals an intention on the part of the creatorof the email message to display a contiguous image to the emailrecipient based on the three separate embedded images.

<html> <head>  <meta content=“text/html;charset=ISO-8859-1”http-equiv=“Content- Type”>  <title></title> </head> <bodybgcolor=“#ffffff” text=“#000000”> <title>abovementioned bertie</title><div align=“center”> [...]     <tr>        <td width=“33%”> <ahref=“http://www.lklljjp.biz/vpr6160/”> <img name=“apprehension”src=“cid:part2.00020108.07020409@72.ca” border=“0” height=“179”width=“184”></a></td>        <td width=“33%”> <ahref=“http://www.lklljjp.biz/vpr6160/”> <img name=“gradate”src=“cid:part3.00060308.03010709@72.ca” border=“0” height=“179”width=“184”></a></td>        <td width=“34%”> <ahref=“http://www.lklljjp.biz/vpr6160/”> <img name=“maltreat”src=“cid:part4.02080304.00040002@72.ca” border=“0” height=“179”width=“184”></a></td>     </tr> [...] </body> </html>

The existence of one or more low entropy frames of an animated GIF maybe determined on an absolute and/or relative basis. For example, ananimated GIF frame may be determined to be low with reference toobserved entropy values of normal GIF files, which vary fromapproximately 0.1 to 5.0. Therefore, in one embodiment, the existence ofone or more low entropy frames is confirmed based on a comparison of theentropy values calculated for the animated GIF at issue to 0.1. If theentropy value calculated for any frame of the animated GIF at issue isless than 0.1, then this abnormal factor is deemed to exist. In otherembodiments, one or more frames of the animated GIF file at issue maysimply be “low” entropy relative to the other high entropy frames. Forexample, a variation of more than 4.9 between the highest entropy frameand the lowest entropy frame relatively lower than the others within theanimated GIF file at issue.

Illegal chunk data within a Portable Network Graphics (PNG) file may beobserved by evaluating information contained within and/or about thechunks. For example, the length of the chunk and cyclic redundancychecksum (CRC) may be verified against the actual data length andrecomputed CRC.

Quantities of unsmoothed curves may be detected by evaluating the amountof pixels in which the difference between their color and the averagecolor of the surrounding pixels are greater than a threshold.

Quantities of unsmoothed color blocks may be detected by evaluating theamount of the color blocks in which the difference between their colorand the color of the surrounding color blocks are greater than athreshold. Color blocks contain pixels with the same or similar colors.

In one embodiment, rather than simply conveying a binary result (e.g., azero indicating the absence of the abnormal factor at issue and a oneindicating the presence of the abnormal factor at issue), a value withina range (e.g., 0 to 10) may be returned indicating the degree to whichthe abnormal factor is expressed.

At block 540, the quantity of text embedded in the images is measured.In one embodiment, images are converted to a binary representation basedon a thresholding technique described in further detail below. Ingeneral, thresholding is a simple method of image segmentation.Individual pixels in a grayscale image are marked as “information”pixels if their value is greater than some threshold value, T, (assumingthe information content is brighter than the background) and as“background” pixels otherwise. Typically, an information pixel is givena value of “1” while a background pixel is given a value of “0.” Then, atext string measurement algorithm is applied to the binaryrepresentation of the portion of the image deemed to contain theinformation content.

Notably, in one embodiment, rather than considering the quantity ofembedded text alone, both the quantity of text and the relative positionof such text within an email viewer's preview window, for example, orwithin the image itself may be taken into consideration. For example, ahigh spam score could be assigned to a very large image (with acorrespondingly smaller percentage of text), but the text is positionedto occupy the whole preview window.

At block 550, the email message is classified as spam or clean based onthe observed characteristics of the embedded image(s), such as imagelocation information, the existence/non-existence of various abnormalfactors and the quantity of text determined to exist within the embeddedimage(s). In one embodiment, the spam/clean classification may be basedupon a weighted average of the various observed characteristics.

In one embodiment, each observed characteristic may contribute to thescore. Once the score reaches a threshold, the email message may beclassified as spam and the further characteristics may not requireanalysis or observation. The email message is classified as clean if thescore is less than the threshold after all the characteristics have beenevaluated. In one embodiment, the characteristics may be considered inthe following order:

-   -   Image location information    -   Presence of continuous images    -   Presence of illegal base64 encoding    -   Presence of lower entropy frames in an animated GIF    -   Presence of illegal chunk data of a PNG encoded image    -   Quantities and/or location of text in the images    -   Quantities of unsmoothed curves in the images    -   Quantities of unsmoothed color blocks in the images

In one embodiment, similar to that described above with reference toabnormal factors, rather than making the ultimate spam/clean decision(because the ultimate decision could be made by another component), a“spaminess” score may be generated. For example, rather than simplyconveying a binary result (e.g., spam vs. clean), a value within a range(e.g., 0 to 10) may be returned indicating the degree to which the emailmessage appeared to contain indications of being spam or the likelihoodthe email message is spam.

If upon completion of the anti-spam processing described above there isnot sufficient data to determine the email message is spam (e.g., thereis insufficient data to determine the sender's intention), thenaccording to one embodiment, more CPU intensive processes, such as OCR,may be invoked. Advantageously, in this manner, most image spam emailscan be detected in real-time without compromising performance and moreCPU intensive processes are only performed if and when required.

FIG. 6 is a flow diagram illustrating quantity of text measurementprocessing in accordance with an embodiment of the present invention.The steps described below represent the processing of block 540 of FIG.5 according to one embodiment of the present invention.

As mentioned with reference to FIG. 5, depending upon the particularimplementation, the various process and decision blocks described belowmay be performed by hardware components, embodied in machine-executableinstructions, which may be used to cause a general-purpose orspecial-purpose processor programmed with the instructions to performthe steps, or the steps may be performed by a combination of hardware,software, firmware and/or involvement of humanparticipation/interaction.

At block 610, if the image at issue is color, then it is converted tograyscale to form a grayscale representation, G_(i,j). According to oneembodiment, color pixels of the image at issue are converted tograyscale by computing an average or weighted average of the red, greenand blue color components. While various conversions may be used,examples of suitable conversion equations include the following:

G _(i,j)=(0.299*r _(i,j)+0.587* g _(i,j)+0.114* b _(i,j))/3 0≦i<x_(max,)0≦j<y _(max)   EQ #1

G _(i,j)=(0.3*r _(i,j)+0.6*g _(i,j)+0.1*b _(i,j))/3 0≦i<x _(max,)0≦j<y_(max)   EQ #2

G _(i,j)=(r _(i,j) +g _(i,j) +b _(i,j))/3 0≦i<x _(max),0≦j<y _(max)   EQ#3

At block 620, entropy and threshold values are determined for thegrayscale image, G_(i,j). Entropy is a statistical measure of randomnessthat can be used to characterize the texture of the input image. Inconnection with calculating the entropy of the grayscale image, anintermediate data structure is built containing an intensity histogram,C^(g). In the context of an 8-bit grayscale image, each pixel may have avalue of 0 to 255. Thus, the intensity histogram includes 256 bins eachof which maintain a count of the number of pixels in the grayscale imagehaving that value. An example of an intensity histogram is shown in FIG.9, which represents an intensity histogram for a grayscalerepresentation of FIG. 8. In one embodiment entropy is calculatedaccording to:

$\begin{matrix}{E = {- {\sum\limits_{g = 0}^{255}\left( {\frac{C^{g}}{\sum C^{g}}*{\log \left( \frac{C^{g}}{\sum C^{g}} \right)}} \right)}}} & {{EQ}\mspace{14mu} {\# 4}} \\{{{Subject}\mspace{14mu} {to}\text{:}}{C^{g} = {\sum\limits_{i = 0}^{x_{\max}}{\sum\limits_{j = 0}^{y_{\max}}c_{i,j}^{g}}}}} & {{EQ}\mspace{14mu} {\# 5}} \\{c_{i,j}^{g} = \left\{ {\begin{matrix}1 & {G_{i,j} = g} \\0 & {otherwise}\end{matrix},{0 \leq i < x_{\max}},{0 \leq j < y_{\max}}} \right.} & {{EQ}\mspace{14mu} {\# 6}}\end{matrix}$

According to one embodiment, a threshold value within the intensityhistogram is selected simply by choosing the mean or median value. Therationale for this simple threshold selection is that if the informationpixels are brighter than the background, they should also be brighterthan the average. However, to compensate for the existence of noise andvariability in the background, a more sophisticated approach is tocreate a histogram of the image pixel intensities and then use thevalley point as the threshold, T. This histogram approach assumes thatthere is some average value for the background and information pixels,but that the actual pixel values have some variation around theseaverage values. In one embodiment, the threshold, T, is calculated by:

T=Max(δ_(i)) 0≦i≦255   EQ#7

Subject to:

δ_(i) =i _(w1) w _(i2)(M _(i1) −M _(i2))²0≦i≦255   EQ #8

$\begin{matrix}{w_{i\; 1} = {\sum\limits_{g = 0}^{i}C^{g}}} & {{EQ}\mspace{14mu} {\# 9}} \\{w_{i\; 2} = {\sum\limits_{g = {i + 1}}^{255}C^{g}}} & {{EQ}\mspace{14mu} {\# 10}} \\{M_{i\; 1} = \frac{\sum\limits_{g = 0}^{i}{g*C^{g}}}{\sum\limits_{g = 0}^{i}C^{g}}} & {{EQ}\mspace{14mu} {\# 11}} \\{M_{i\; 2} = \frac{\sum\limits_{g = {i + 1}}^{255}{k*C^{g}}}{\sum\limits_{g = {i + 1}}^{255}C^{g}}} & {{EQ}\mspace{14mu} {\# 12}}\end{matrix}$

According to the above example, the gray levels are divided into twogroups by i, and w_(i1) and w_(i2) are the total amount of the pixels ofeach group while M_(i1) and M_(i2) are the average of the gray level ofeach group.

Notably, there are many existing methods of performing thresholding.Consequently, any other current or future method of performingthresholding may be used depending upon the needs of a particularimplementation.

At block 630, thresholding is performed to form a binary representation,B_(i,j), of the grayscale image based on the threshold value selected inblock 620. In one embodiment, thresholding is performed in accordancewith the following equations:

$\begin{matrix}{B_{i,j} = \left\{ {{{\begin{matrix}0 & {G_{i,j} < T} \\1 & {Otherwise}\end{matrix}0} \leq i < x_{\max}},{0 \leq j < y_{\max}}} \right.} & {{EQ}\mspace{14mu} {\# 13}} \\{B_{i,j}^{\prime} = \left\{ {{\begin{matrix}B_{i,j} & {{{\max \left( C^{k} \right)} < \partial},{{\max \left( C^{k} \right)} < T}} \\{!B_{i,j}} & {Otherwise}\end{matrix}0} \leq k \leq 255} \right.} & {{EQ}\mspace{14mu} {\# 14}}\end{matrix}$

where, ∂ is an adjustable parameter.

At block 640, the binary image is logically divided into M×N virtualblocks.

At block 650, the M×N virtual blocks are analyzed to quantify the numberof text strings. In one embodiment, the text strings in the binary imageare quantified in accordance with the following equations:

$\begin{matrix}{T = {\sum\limits_{m = 0}^{M}{\sum\limits_{n = 0}^{N}T^{m,n}}}} & {{EQ}\mspace{14mu} {\# 15}} \\{{{Subject}\mspace{20mu} {to}\text{:}}{T^{m,n} = {\sum\limits_{y_{t} = y_{0}^{m}}^{y_{\max}^{m}}{\sum\limits_{y_{b} = {y_{t} + 1}}^{y_{\max}^{m}}T_{y_{t},y_{b}}^{m,n}}}}{{y_{0}^{m} = {\frac{y_{\max}}{\partial_{0}}\left( {m - 1} \right)}},{y_{\max}^{m} = {y_{0}^{m} + \partial_{0}}}}} & {{EQ}\mspace{14mu} {\# 16}} \\{T_{y_{t},y_{b}}^{m,n} = \left\{ \begin{matrix}1 & {\partial_{1}{> {\sum\limits_{i = y_{t}}^{y_{b}}{CB}_{i}^{n}} > \partial_{2}}} \\{{{\sum\limits_{k = x_{0}^{n}}^{x_{\max}^{n}}B_{k,{y_{b} + 1}}} < \partial_{3}},} & {{x_{0}^{n} = {\frac{x_{\max}}{\partial_{0}}\left( {n - 1} \right)}},{x_{\max}^{n} = {x_{0}^{n} + \partial_{0}}}} \\0 & {Otherwise}\end{matrix} \right.} & {{EQ}\mspace{14mu} {\# 17}} \\{{CB}_{i}^{n} = \left\{ \begin{matrix}1 & {\partial_{4}{> {\sum\limits_{k = x_{0}^{n}}^{x_{\max}^{n}}B_{k,i}} > \partial_{5}}} \\\; & {{{{Max}\left( {\sum\limits_{k = x_{w}}^{x_{w} + \partial_{6}}B_{k,i}} \right)} < \partial_{7}},{x_{0}^{n} \leq x_{w} \leq x_{\max}^{n}}} \\0 & {Otherwise}\end{matrix} \right.} & {{EQ}\mspace{14mu} {\# 18}}\end{matrix}$

where,

∂₀ . . . a∂₇ are adjustable parameters;

T_(y) _(t) ,_(y) _(b) ^(m,n) is the likelihood that the row betweeny_(t) and y_(b) in the block [m,n] represents text;

CB_(i) ^(n) is the likelihood that the line[i] is a part of text;

B_(k,i) is the value of pixel[k,i] in the binary image.

Notably, while in the context of the equations presented above, a globalthresholding approach is implemented taking into consideration the imageas a whole, in alternative embodiments, various forms of localthresholding may be performed that consider groups of blocks orindividual blocks to determine the best thresholding approach for suchblock or blocks.

CONCRETE EXAMPLES

For sake of illustration, two concrete examples of application of thethresholding and text quantification described above will now beprovided with reference to FIG. 7 to FIG. 13.

FIG. 7 is an example of an image spam email message 700 containing anembedded image 710. Typically, such image spam email messages alsoinclude text 720 in an attempt to defeat conventional heuristic filters.

FIG. 8 is a grayscale image 810 based on the embedded image 710 of FIG.7 according to one embodiment of the present invention. According to theflow diagram of FIG. 6, the first step (block 610) is to convert theembedded image 710 to a grayscale representation, G_(i,j). Assumingembedded image 710 of FIG. 7 is a color image having red (r), green (g)and blue (b) color components, after application of one of equations EQ#1, EQ #2, EQ #3 or the like, the grayscale representation, G_(i,j),appears as grayscale image 810.

FIG. 9 is an intensity histogram 900 for the grayscale image 810 of FIG.8 according to one embodiment of the present invention. According to theflow diagram of FIG. 6, the next step (block 620) is to build anintensity histogram data structure, C^(g), and determine a thresholdvalue for the grayscale image 810. After application of one or more ofequations EQ #4, EQ #5, EQ #6, EQ #7, EQ #8, EQ #9, EQ #10, EQ #11, EQ#12 or the like to the grayscale representation, G_(i,j,) (grayscaleimage 810), an intensity histogram data structure, C^(g), results, whichappears as intensity histogram 900 when displayed in graphical form.Assuming 256 possible gray levels are represented in grayscale image810, the intensity histogram 900 graphically illustrates the number ofpixel occurrences in grayscale image 810 for each gray level.

Application of the above-referenced equations also results in athreshold value, T, 910, being calculated for grayscale image 810.According to this example, the threshold value 910 is 109.

FIG. 10 is a binary image 1010 resulting from thresholding the grayscaleimage 810 of FIG. 8 in accordance with an embodiment of the presentinvention. According to the flow diagram of FIG. 6, the next step (block630) is to binarize the image by performing thresholding with thecalculated threshold value. Application of one or both of equations EQ#13 and EQ #14 or the like, causes the binary representation, B_(i,j),to contain a zero for each pixel in which the grayscale representationG_(i,j), is less than the calculated threshold value, T, and to containa one for each pixel in which the grayscale representation G_(i,j), isgreater than or equal to the calculated threshold value, T. For purposesof illustration, the result of graphically depicting the binaryrepresentation, B_(i,j), in which pixels having a value of one arepresented as black and pixels having a value of zero are presented aswhite image is shown as binary image 1010. As can be seen with referenceto FIG. 10, the information content intended to be conveyed, i.e., thevarious text strings, to the email recipient is now clearlydistinguishable from the background.

FIG. 11 illustrates an exemplary segmentation of the binary image ofFIG. 10 into 28 virtual blocks and highlights the text strings detectedwithin the blocks in accordance with an embodiment of the presentinvention. According to the flow diagram of FIG. 6, the next steps(blocks 640 and 650) are to logically divide the binary image 1010 intovirtual blocks and then separately analyze each block to measureperceived text content. Application of one or more of equations EQ #15,EQ #16, EQ #17, EQ #18 or the like, causes the text string count, T, tocontain the sum of all blocks determined to contain a text string.

In the present example, segmented binary image 1110 contains 28 virtualblocks, examples of which are pointed out with reference numerals 1120and 1130. According to equations EQ #15, EQ #16, EQ #17 and EQ #18, 23of the 28 blocks contain a total of 63 text strings. Text stringsdetected by the algorithm are underlined. Block 1120 is an example of ablock that has been determined to contain one or more text strings,i.e., the word “TRADE” 1121. Block 1130 is an example of a block thathas been determined not to contain a text string.

FIG. 12 is a grayscale image 1210 based on another exemplary embeddedimage observed in connection with image spam.

FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks abinarized image 1310 corresponding to the grayscale image 1210 of FIG.12 and highlights the text strings detected within the blocks inaccordance with an embodiment of the present invention. In the presentexample, segmented binary image 1310 contains 56 virtual blocks,examples of which are pointed out with reference numerals 1320 and 1330.According to equations EQ #15, EQ #16, EQ #17 and EQ #18, 26 of the 56blocks contain a total of 40 text strings. Text strings detected by thealgorithm are underlined. Block 1320 is an example of a block that hasbeen determined to contain one or more text strings, i.e., the group ofletters “ebtEras”. Block 1330 is an example of a block that has beendetermined not to contain a text string.

Notably, to the extent reverse video or the presentation of lightcolored (e.g., white) text in the context of a dark (e.g., black)background becomes problematic (see, e.g., the “LEARN MORE” text stringembedded within FIG. 13), one approach to detect such text strings wouldbe to apply a local thresholding approach using EQ #14, which wouldeffectively reverse the black and white pixels for the blocks at issue.

While embodiments of the invention have been illustrated and described,it will be clear that the invention is not limited to these embodimentsonly. Numerous modifications, changes, variations, substitutions, andequivalents will be apparent to those skilled in the art, withoutdeparting from the spirit and scope of the invention, as described inthe claims.

1. A method comprising: measuring or estimating one or more of thequantity and position of text within an image associated with anelectronic message; and estimating the likelihood that the electronicmessage is spain based at least in part on results of the measuring orestimating.
 2. The method of claim 1, wherein the electronic messagecomprises an electronic mail (email) message.
 3. The method of claim 1,wherein the image is divided up into a plurality of blocks and imageprocessing is applied to each of the plurality of blocks.
 4. The methodof claim 3, wherein the image processing includes local thresholding. 5.The method of claim 3, wherein the image processing includes globalthresholding.
 6. The method of claim 1, wherein filtering is applied tothe image to remove noise deliberately added by an originator of theelectronic message.
 7. The method of claim 3, wherein the imageprocessing comprises converting the image or one or more of theplurality of blocks to grayscale.
 8. The method of claim 3, furthercomprising determining which colors or intensities are likely torepresent text within the image or within one or more of the pluralityof blocks by calculating an intensity histogram for the image or for theone or more of the plurality of blocks.
 9. The method of claim 3,wherein the quantity of text is measured or estimated by summing thenumber of blocks within a portion of the image visible in a preview paneof an email client. 10-27. (canceled)
 28. A computer-readable mediumhaving stored thereon instructions, which when executed by one or moreprocessors cause the one or more processors to perform a methodcomprising: measuring or estimating one or more of the quantity andposition of text within an image associated with an electronic message;and estimating the likelihood that the electronic message is spain basedat least in part on results of the measuring or estimating.
 29. Thecomputer-readable medium of claim 28, wherein the electronic messagecomprises an electronic mail (email) message.
 30. The computer-readablemedium of claim 28, wherein the image is divided up into a plurality ofblocks and image processing is applied to each of the plurality ofblocks.
 31. The computer-readable medium of claim 30, wherein the imageprocessing includes local thresholding.
 32. The computer-readable mediumof claim 30, wherein the image processing includes global thresholding.33. The computer-readable medium of claim 28, wherein filtering isapplied to the image to remove noise deliberately added by an originatorof the electronic message.
 34. The computer-readable medium of claim 30,wherein the image processing comprises convening the image or one ormore of the plurality of blocks to grayscale.
 35. The computer-readablemedium of claim 30, further comprising determining which colors orintensities are likely to represent text within the image or within oneor more of the plurality of blocks by calculating an intensity histogramfor the image or for the one or more of the plurality of blocks.
 36. Thecomputer-readable medium of claim 30, wherein the quantity of text ismeasured or estimated by summing the number of blocks within a portionof the image visible in a preview pane of an email client.